{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers\n",
    "\n",
    "FLAN-T5, released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper, is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters. Google has open sourced [5 checkpoints available on Hugging Face](https://huggingface.co/models?other=arxiv:2210.11416) ranging from 80M parameter up to 11B parameter.\n",
    "\n",
    "In a previous blog post, we already learned how to [“Fine-tune FLAN-T5 for chat & dialogue summarization”](https://www.philschmid.de/fine-tune-flan-t5) using [the base version (250M parameter)](https://huggingface.co/google/flan-t5-base) of the model. In this blog post, we look into how we can scale the training from the Base version to the [XL (3B)](https://huggingface.co/google/flan-t5-xl) or [XXL (11B)](https://huggingface.co/google/flan-t5-xxl). \n",
    "\n",
    "This means we will learn how to fine-tune FLAN-T5 XL & XXL using model parallelism, multiple GPUs, and [DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/). \n",
    "\n",
    "in addition to the tutorial, we have run a series of experiments to help you choose the right hardware setup. You can find the details in the Results & Experiments section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install git lfs for pushing artifacts\n",
    "!sudo apt install git-lfs\n",
    "# install torch with the correct cuda version, check nvcc --version\n",
    "!pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade\n",
    "# install Hugging Face Libraries\n",
    "!pip install \"transformers==4.26.0\" \"datasets==2.9.0\" \"accelerate==0.16.0\" \"evaluate==0.4.0\" --upgrade\n",
    "# install deepspeed and ninja for jit compilations of kernels\n",
    "!pip install \"deepspeed==0.8.0\" ninja --upgrade\n",
    "# install additional dependencies needed for training\n",
    "!pip install rouge-score nltk py7zr tensorboard"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Process dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the [“Fine-tune FLAN-T5 for chat & dialogue summarization”](https://www.philschmid.de/fine-tune-flan-t5) we need to prepare a dataset to fine-tune our model. As mentioned in the beginning, we will fine-tune [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). The blog post is not going into detail about the dataset generation. If you want to learn the detailed steps check out the [previous post](https://www.philschmid.de/fine-tune-flan-t5). \n",
    "\n",
    "We define some parameters, which we use throughout the whole example, feel free to adjust it to your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# experiment config\n",
    "model_id = \"google/flan-t5-xxl\" # Hugging Face Model Id\n",
    "dataset_id = \"cnn_dailymail\" # Hugging Face Dataset Id\n",
    "dataset_config = \"3.0.0\" # config/verison of the dataset\n",
    "save_dataset_path = \"data\" # local path to save processed dataset\n",
    "text_column = \"article\" # column of input text is\n",
    "summary_column = \"highlights\" # column of the output text \n",
    "# custom instruct prompt start\n",
    "prompt_template = f\"Summarize the following news article:\\n{{input}}\\nSummary:\\n\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared to the [previous example](https://www.philschmid.de/fine-tune-flan-t5), we are splitting the processing and training into two separate paths. This allows you to run the preprocessing outside of the GPU instance. We process (tokenize) the dataset and save it to disk and then load in our train script from disk again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from transformers import AutoTokenizer\n",
    "import numpy as np \n",
    "\n",
    "# Load dataset from the hub\n",
    "dataset = load_dataset(dataset_id,name=dataset_config)\n",
    "# Load tokenizer of FLAN-t5-base\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "\n",
    "print(f\"Train dataset size: {len(dataset['train'])}\")\n",
    "print(f\"Test dataset size: {len(dataset['test'])}\")\n",
    "\n",
    "# Train dataset size: 287113\n",
    "# Test dataset size: 11490"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt length: 12\n",
      "Max input length: 500\n"
     ]
    }
   ],
   "source": [
    "prompt_length = len(tokenizer(prompt_template.format(input=\"\"))[\"input_ids\"])\n",
    "max_sample_length = tokenizer.model_max_length - prompt_length\n",
    "print(f\"Prompt length: {prompt_length}\")\n",
    "print(f\"Max input length: {max_sample_length}\")\n",
    "\n",
    "# Prompt length: 12\n",
    "# Max input length: 500"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/json": {
       "ascii": false,
       "bar_format": null,
       "colour": null,
       "elapsed": 0.012465238571166992,
       "initial": 0,
       "n": 0,
       "ncols": null,
       "nrows": null,
       "postfix": null,
       "prefix": "",
       "rate": null,
       "total": 299,
       "unit": "ba",
       "unit_divisor": 1000,
       "unit_scale": false
      },
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "32577879b38640f898e798ea8f88a801",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/299 [00:00<?, ?ba/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max source length: 500\n"
     ]
    },
    {
     "data": {
      "application/json": {
       "ascii": false,
       "bar_format": null,
       "colour": null,
       "elapsed": 0.011892318725585938,
       "initial": 0,
       "n": 0,
       "ncols": null,
       "nrows": null,
       "postfix": null,
       "prefix": "",
       "rate": null,
       "total": 299,
       "unit": "ba",
       "unit_divisor": 1000,
       "unit_scale": false
      },
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "724cc7afe0ba49a3b8a6a763a189e380",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/299 [00:00<?, ?ba/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max target length: 129\n"
     ]
    }
   ],
   "source": [
    "from datasets import concatenate_datasets\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "# The maximum total input sequence length after tokenization. \n",
    "# Sequences longer than this will be truncated, sequences shorter will be padded.\n",
    "tokenized_inputs = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n",
    "max_source_length = max([len(x) for x in tokenized_inputs[\"input_ids\"]])\n",
    "max_source_length = min(max_source_length, max_sample_length)\n",
    "print(f\"Max source length: {max_source_length}\")\n",
    "\n",
    "# The maximum total sequence length for target text after tokenization. \n",
    "# Sequences longer than this will be truncated, sequences shorter will be padded.\"\n",
    "tokenized_targets = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n",
    "target_lenghts = [len(x) for x in tokenized_targets[\"input_ids\"]]\n",
    "# use 95th percentile as max target length\n",
    "max_target_length = int(np.percentile(target_lenghts, 95))\n",
    "print(f\"Max target length: {max_target_length}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have everything needed to process our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "def preprocess_function(sample, padding=\"max_length\"):\n",
    "    # created prompted input\n",
    "    inputs = [prompt_template.format(input=item) for item in sample[text_column]]\n",
    "\n",
    "    # tokenize inputs\n",
    "    model_inputs = tokenizer(inputs, max_length=tokenizer.model_max_length, padding=padding, truncation=True)\n",
    "\n",
    "    # Tokenize targets with the `text_target` keyword argument\n",
    "    labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)\n",
    "\n",
    "    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore\n",
    "    # padding in the loss.\n",
    "    if padding == \"max_length\":\n",
    "        labels[\"input_ids\"] = [\n",
    "            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n",
    "        ]\n",
    "\n",
    "    model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
    "    return model_inputs\n",
    "\n",
    "# process dataset\n",
    "tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset[\"train\"].features))\n",
    "\n",
    "# save dataset to disk\n",
    "tokenized_dataset[\"train\"].save_to_disk(os.path.join(save_dataset_path,\"train\"))\n",
    "tokenized_dataset[\"test\"].save_to_disk(os.path.join(save_dataset_path,\"eval\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-tune model using `deepspeed`\n",
    "\n",
    "Done! We can now start training our model! We learned in the introduction that we would leverage the DeepSpeed integration with the Hugging Face Trainer. Therefore we need to create a `deespeed_config.json`. In the [DeepSpeed Configuration,](https://www.deepspeed.ai/docs/config-json/) we define the ZeRO strategy we want to use and if we want to use mixed precision training. The Hugging Face Trainer allows us to inherit values from the `TrainingArguments` in our `deepspeed_config.json` to avoid duplicate values, check the [documentation for more information.](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/deepspeed#configuration)\n",
    "\n",
    "We created 4 deepspeed configurations for the experiments we ran, including `CPU offloading` and `mixed precision`: \n",
    "\n",
    "- [ds_flan_t5_z3_config.json](./configs/ds_flan_t5_z3_config.json)\n",
    "- [ds_flan_t5_z3_config_bf16.json](./configs/ds_flan_t5_z3_config_bf16.json)\n",
    "- [ds_flan_t5_z3_offload.json](./configs/ds_flan_t5_z3_offload.json)\n",
    "- [ds_flan_t5_z3_offload_bf16.json](./configs/ds_flan_t5_z3_offload_bf16.json)\n",
    "\n",
    "Depending on your setup, you can use those, e.g. if you are running on NVIDIA V100s, you have to use the config without `bf16` since V100 are not support `bfloat16` types. \n",
    "\n",
    "> When fine-tuning `T5` models we cannot use `fp16` since it leads to overflow issues, see: [#4586](https://github.com/huggingface/transformers/issues/4586), [#10830](https://github.com/huggingface/transformers/issues/10830), [#10956](https://github.com/huggingface/transformers/pull/10956)\n",
    "> \n",
    "\n",
    "As mentioned in the beginning, we are using a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. \n",
    "\n",
    "We are going to use the [ds_flan_t5_z3_config_bf16.json](./configs/ds_flan_t5_z3_config_bf16.json). If you are irritated by the `auto` values, check the [documentation](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/deepspeed#configuration)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
      "deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py --model_id google/flan-t5-xxl --dataset_path data --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --generation_max_length 129 --lr 1e-4 --deepspeed configs/ds_flan_t5_z3_config_bf16.json\n"
     ]
    }
   ],
   "source": [
    "!deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py \\\n",
    "    --model_id $model_id \\\n",
    "    --dataset_path $save_dataset_path \\\n",
    "    --epochs 3 \\\n",
    "    --per_device_train_batch_size 8 \\\n",
    "    --per_device_eval_batch_size 8 \\\n",
    "    --generation_max_length $max_target_length \\\n",
    "    --lr 1e-4 \\\n",
    "    --deepspeed configs/ds_flan_t5_z3_config_bf16.json "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Results\n",
    "\n",
    "dataset: `\"cnn_dailymail\"`\n",
    "-  training examples: `287113`\n",
    "-  validation examples: `13368`\n",
    "\n",
    "hyperparamters:\n",
    "- epochs: `3`\n",
    "- learning rate: `1e-4`\n",
    "\n",
    "setups: \n",
    "- 4x V100 16GB: p3.8xlarge\n",
    "- 4x A10G 24GB: g5.24xlarge\n",
    "- 8x V100 16GB: p3.16xlarge\n",
    "- 8x A100 40GB: p4dn.24xlarge\n",
    "\n",
    "\n",
    "| Model             | DS offload | Hardware     | batch size per GPU | precision | duration | cost   |\n",
    "|-------------------|------------|--------------|--------------------|-----------|----------|--------|\n",
    "| FLAN-T5-XL (3B)   | No         | 4x V100 16GB | OOM                | fp32      | -        | -      |\n",
    "| FLAN-T5-XL (3B)   | No         | 8x V100 16GB | 1                  | fp32      | 105h     | ~$2570 |\n",
    "| FLAN-T5-XL (3B)   | No         | 8x A100 40GB | 72                 | bf16     |   2,5h       | ~$81       |\n",
    "| FLAN-T5-XL (3B)   | Yes        | 4x V100 16GB | 8                  | fp32      | 69h      | ~$828  |\n",
    "| FLAN-T5-XL (3B)   | Yes        | 8x V100 16GB | 8                  | fp32      | 32h      | ~$768  |\n",
    "| FLAN-T5-XXL (11B) | No        | 8x A100 40GB | 8                | bf16      | 10h        | ~$322      |\n",
    "| FLAN-T5-XXL (11B) | Yes        | 4x V100 16GB | OOM                | fp32      | -        | -      |\n",
    "| FLAN-T5-XXL (11B) | Yes        | 8x V100 16GB | OOM                | fp32      | -        | -      |\n",
    "| FLAN-T5-XXL (11B) | Yes        | 4x A10G 24GB | 24                | bf16      | 90h      | ~$732  |\n",
    "| FLAN-T5-XXL (11B) | Yes        | 8x A100 40GB | 48                | bf16      | 19h      | ~$613  |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
